Comparing variety corpora with vis-à-vis - A prototype system presentation

نویسنده

  • Stefanie Anstein
چکیده

In this paper, the prototype system Vis-ÀVis to support linguists in their comparison of regional language varieties is presented. Written corpora are used as an empirical basis to extract differences semiautomatically. For the analysis, existing and adapted as well as new tools with both pattern-based and statistical approaches are applied. The processing of the corpus input consists in the annotation of the data, the extraction of phenomena from different levels of linguistic description, and their quantitative comparison for the identification of significantly different phenomena in the two input corpora. Vis-À-Vis produces sorted ‘candidate’ lists for peculiarities of varieties by filtering according to statistical association measures as well as using corpus-external knowledge to reduce the output to presumably significant phenomena. Traditional regional variety linguists benefit from these results using them as a compact empirical basis – extracted from large amounts of authentic data – for their detailed qualitative analyses. Via a user-friendly application of a comprehensive computational system, they are supported in efficiently extracting differences between varieties e. g. for documentation, lexicography, or didactics of pluri-centric languages. 1 Background and related work Pluri-centric languages are languages with more than one national center and with specific national varieties (Clyne, 1992). The latter usually differ to a certain extent on different levels of linguistic description, mostly on the lexical level – an example for variants in German being Marille (used in Austria and South Tyrol) vs. Aprikose (used in Germany and Switzerland) for ‘apricot’. The question to be answered in the framework research project is to what extent the comparison of varieties for supporting variety linguists’ manual analyses can be automated with natural language processing (NLP) methods. The analysis results obtained with such computational systems will contribute to variety documentation, lexicography, and language didactics. Vis-À-Vis has been developed for the case of the pluri-centric language German (Ammon, 1995) for the time being; its development originated in the initiatives Korpus Südtirol1 and C42. The former is preparing a written text corpus of South Tyrolean German3 (Anstein et al., 2011), which can also be queried together with other German variety corpora with the help of the distributed query engine implemented in the C4 project (Dittmann et al., 2012). In addition to interactively run single queries in the C4 corpora, variety linguists can use Vis-À-Vis to exploratively and empirically analyse and compare corpora on the desired levels of linguistic description. This is especially relevant since the amount of electronically available data constantly increases and can no longer be handled purely manually. The benefit of supportive tools from the NLP community for emhttp://www.korpus-suedtirol.it http://www.korpus-c4.org South Tyrolean German is the German variety used as an official language in the Autonomous Province of Bolzano / South Tyrol in Northern Italy (Egger and Lanthaler, 2001).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Assessment of the Awareness and Practice of Women vis-à-vis Breast Self-Examination in Fasa in 2011

Background & Objective: Breast cancer is one of the most important causes of women's mortality the world over. Breast self-examination (BSE) is a method that often leads to detect breast cancer in the early stage. This study aimed at assessing the awareness and practice of women in the city of FASA vis-à-vis BSE.  Materials & Methods: In this descriptive-analytical study , 300 women over 15 yea...

متن کامل

Authenticating ‘Cover to Cover’ Reader Series vis-à-vis Cultural Norms for the Iranian community

This research study was an attempt to explore hidden cultural components in an ELT textbook from Oxford University Press (OUP) titled 'Cover to Cover'. Two research methodologies were relied on to unveil the western ideologies in this series: Firstly, a qualitative review over its reading textbooks was undertaken for authenticating the hidden western values for Iranian contexts. At this stage, ...

متن کامل

Bilingual Terminology Mining - Using Brain, not brawn comparable corpora

Current research in text mining favours the quantity of texts over their quality. But for bilingual terminology mining, and for many language pairs, large comparable corpora are not available. More importantly, as terms are defined vis-à-vis a specific domain with a restricted register, it is expected that the quality rather than the quantity of the corpus matters more in terminology mining. Ou...

متن کامل

Design and Evaluation in Visualization Research

Until very recently, the emphasis in Visualization research has been on methods, their algorithmic underpinnings, and their implementation in systems. Most papers have been of the proof of concept variety: describing new ideas for attacking a visualization problem and demonstrating feasibility and quality by presenting visual and performance results from a prototype implementation. Typical publ...

متن کامل

A Biarticulated Robotic Leg for Jumping Movements: Theory and Experiments

This paper investigates the extent to which biarticular actuation mechanisms—springdriven redundant actuation schemes that extend over two joints, similar in function to biarticular muscles found in legged animals—improve the performance of jumping and other fast explosive robot movements. Robust numerical optimization algorithms that take into account the complex dynamics of both the redundant...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012